The intent of this file is to sintetize EDA approaches learned during Udacity’s Nanodegree in R language. For the analysis, a database from a Chicago bike sharing company will be used.
First, let’s download all libraries that will be used, as well as the bike sharing database.
# Data visualization
#install.packages('ggplot2')
library('ggplot2')
#install.packages("knitr")
#install.packages('plotly')
library('plotly')
#install.packages('gridExtra')
library('gridExtra')
# Dat manipulation
#install.packages('dplyr')
library('dplyr')
#install.packages("tidyr")
library('tidyr')## [1] "data.frame"
Let’s check how many rows and columns our dataset have, as well as check the summary of all the variables:
## [1] 1551505 8
## Start.Time End.Time
## 2017-04-15 13:27:10: 7 2017-05-21 13:07:49: 8
## 2017-05-29 15:19:36: 7 2017-06-10 13:31:26: 8
## 2017-04-15 13:02:35: 6 2017-02-18 14:24:00: 7
## 2017-04-26 16:07:17: 6 2017-06-02 17:16:46: 7
## 2017-05-15 17:14:43: 6 2017-06-03 14:32:22: 7
## 2017-05-16 17:56:13: 6 2017-06-10 18:47:17: 7
## (Other) :1551467 (Other) :1551461
## Trip.Duration Start.Station
## Min. : 60.0 Streeter Dr & Grand Ave : 36686
## 1st Qu.: 392.0 Clinton St & Washington Blvd: 22429
## Median : 670.0 Lake Shore Dr & Monroe St : 21822
## Mean : 939.8 Clinton St & Madison St : 19098
## 3rd Qu.: 1127.0 Canal St & Adams St : 18105
## Max. :86338.0 Lake Shore Dr & North Blvd : 17729
## (Other) :1415636
## End.Station User.Type
## Streeter Dr & Grand Ave : 39537 Customer : 317162
## Clinton St & Washington Blvd: 21113 Dependent : 4
## Clinton St & Madison St : 20969 Subscriber:1234339
## Lake Shore Dr & Monroe St : 20652
## Lake Shore Dr & North Blvd : 19734
## Theater on the Lake : 18529
## (Other) :1410971
## Gender Birth.Year
## :316867 Min. :1899
## Female:298784 1st Qu.:1975
## Male :935854 Median :1984
## Mean :1981
## 3rd Qu.:1989
## Max. :2016
## NA's :316683
Our dataframe has around 1,5M rows and 8 columns. The variables goes as following:
Let’s take a look at the first 20 rows of the dataset:
## Start.Time End.Time Trip.Duration
## 1 2017-01-01 00:00:36 2017-01-01 00:06:32 356
## 2 2017-01-01 00:02:54 2017-01-01 00:08:21 327
## 3 2017-01-01 00:06:06 2017-01-01 00:18:31 745
## 4 2017-01-01 00:07:28 2017-01-01 00:12:51 323
## 5 2017-01-01 00:07:57 2017-01-01 00:20:53 776
## 6 2017-01-01 00:10:44 2017-01-01 00:21:27 643
## 7 2017-01-01 00:11:34 2017-01-01 00:23:47 733
## 8 2017-01-01 00:14:57 2017-01-01 00:26:22 685
## 9 2017-01-01 00:15:03 2017-01-01 00:26:28 685
## 10 2017-01-01 00:17:01 2017-01-01 00:29:49 768
## 11 2017-01-01 00:17:13 2017-01-01 11:03:34 38781
## 12 2017-01-01 00:18:28 2017-01-01 00:31:05 757
## 13 2017-01-01 00:18:50 2017-01-01 00:21:47 177
## 14 2017-01-01 00:23:41 2017-01-01 00:29:13 332
## 15 2017-01-01 00:25:47 2017-01-01 00:39:53 846
## 16 2017-01-01 00:25:47 2017-01-01 00:43:23 1056
## 17 2017-01-01 00:26:21 2017-01-01 00:39:40 799
## 18 2017-01-01 00:27:21 2017-01-01 00:42:59 938
## 19 2017-01-01 00:27:28 2017-01-01 00:42:44 916
## 20 2017-01-01 00:27:45 2017-01-01 00:31:13 208
## Start.Station End.Station
## 1 Canal St & Taylor St Canal St & Monroe St (*)
## 2 Larrabee St & Menomonee St Sheffield Ave & Kingsbury St
## 3 Orleans St & Chestnut St (NEXT Apts) Ashland Ave & Blackhawk St
## 4 Franklin St & Monroe St Clinton St & Tilden St
## 5 Broadway & Barry Ave Sedgwick St & North Ave
## 6 State St & Kinzie St Wells St & Polk St
## 7 Wabash Ave & Wacker Pl Clinton St & Tilden St
## 8 Daley Center Plaza Canal St & Monroe St (*)
## 9 Daley Center Plaza Canal St & Monroe St (*)
## 10 Dayton St & North Ave Ogden Ave & Chicago Ave
## 11 Wilton Ave & Diversey Pkwy Halsted St & Wrightwood Ave
## 12 Canal St & Madison St LaSalle St & Illinois St
## 13 Theater on the Lake Lakeview Ave & Fullerton Pkwy
## 14 Halsted St & Maxwell St Halsted St & 18th St
## 15 Ravenswood Ave & Lawrence Ave Clarendon Ave & Gordon Ter
## 16 Clark St & Congress Pkwy Wolcott Ave & Polk St
## 17 Ravenswood Ave & Lawrence Ave Clarendon Ave & Gordon Ter
## 18 Millennium Park Michigan Ave & 18th St
## 19 Millennium Park Michigan Ave & 18th St
## 20 Damen Ave & Chicago Ave Damen Ave & Division St
## User.Type Gender Birth.Year
## 1 Customer NA
## 2 Subscriber Male 1984
## 3 Subscriber Male 1985
## 4 Subscriber Male 1990
## 5 Subscriber Male 1990
## 6 Subscriber Male 1970
## 7 Subscriber Male 1986
## 8 Customer NA
## 9 Customer NA
## 10 Customer NA
## 11 Subscriber Female 1988
## 12 Customer NA
## 13 Subscriber Male 1991
## 14 Subscriber Male 1984
## 15 Subscriber Female 1987
## 16 Subscriber Male 1984
## 17 Subscriber Male 1987
## 18 Subscriber Male 1991
## 19 Subscriber Female 1990
## 20 Subscriber Male 1982
How are the rides distributed along each day of the week? Is there a difference between weekdays and weekends?
df$day <- weekdays(as.Date(df$Start.Time))
grouped_per_weekday <- df %>%
group_by(day) %>%
summarise(count_trip_duration = round(length(Trip.Duration),0))
ggplot(grouped_per_weekday, aes(x = factor(day,c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')),
y = count_trip_duration)) + geom_col() +
geom_text(aes(label=count_trip_duration), position=position_dodge(width=0.9), vjust=-0.25) + ggtitle("Count of Trips per Weekday") + xlab("Number of Trips") + ylab("Weekday")grouped_per_weekday <- df %>%
group_by(day) %>%
summarise(mean_trip_duration = round(mean(Trip.Duration),0))
ggplot(grouped_per_weekday, aes(x = factor(day,c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')),
y = mean_trip_duration)) + geom_col() +
geom_text(aes(label=mean_trip_duration), position=position_dodge(width=0.9), vjust=-0.25) + ggtitle("Average Trip Duration (in seconds) per Weekday") + xlab("Average Trip Duration") + ylab("Weekday")Although there are slightly fewer bike trips during the weekend, the trip duration are considerably higher when compared to week days.
Since the trip duration data is over-dispersed, let’s apply a logarithm and a square root transformation on it.
p1 <- ggplot(aes(x = Trip.Duration), data = df) + geom_histogram() + ggtitle("Histogram for Trip Duration")
p2 <- p1 + scale_x_log10() + ggtitle("Histogram for Trip Duration with Log10 normalization")
p3 <- p1 + scale_x_sqrt() + ggtitle("Histogram for Trip Duration with Sqrt normalization")
grid.arrange(p1,p2,p3,ncol=1)Let’s check how many male riders we have:
## # A tibble: 1 x 1
## n
## <int>
## 1 935854
What about female riders?
## # A tibble: 1 x 1
## n
## <int>
## 1 298784
Let’s check how each Gender compare considering their average trip duration:
## df$Gender:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60 927 1379 1875 1962 86338
## --------------------------------------------------------
## df$Gender: Female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 398.8 648.0 783.7 1010.0 85782.0
## --------------------------------------------------------
## df$Gender: Male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 338.0 541.0 673.1 861.0 86096.0
Although there’s significantly more male riders than female ones, the female have longer trip durations on average. We can check how they compare using a frequency polygon:
ggplot(aes(x = Trip.Duration), data = subset(df, Gender %in% c('Male','Female'))) +
geom_freqpoly(aes(color = Gender)) +
scale_x_log10() + ggtitle("Frequency Polygon Comparing Gender based on Trip Duration")We can see that the female normal distribution is slightly more left-skewed than the male curve, validating what we saw earlier.
## Warning: Removed 316683 rows containing non-finite values (stat_bin).
Most users birth year falls between 1950 and 2000, with a peak around 1980-1990.
Distribution of types of user:
grouped_users <- df %>%
group_by(User.Type) %>%
summarise(count_trips = length(Trip.Duration))
ggplot(grouped_users, aes(User.Type,count_trips)) + geom_col() +
geom_text(aes(label=count_trips), position=position_dodge(width=0.9), vjust=-0.25) + ggtitle("Count of each User Type")We can see that most of the riders are actual subscribers.
Using the plotly package, we can creat the trip duration per birth year of users, also showing the start and end station. Since the chart is heavy for the amout of data, we will use a 10k sample of the data.
set.seed(1)
sample_df <- df[sample(nrow(df), 10000), ]
plot_ly(data = sample_df, x = sample_df$Trip.Duration , y = sample_df$Birth.Year,
marker = list(size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',width = 2)),
text = ~paste("Start Station: ", sample_df$Start.Station, '$<br>Finish Station:', sample_df$End.Station)
) %>%
layout(title = 'Trip duration distribution',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))## Warning: Ignoring 2053 observations
We can see that the majority of trips stay under the 10k seconds threshold and user age doesn’t really seem to have an impact on it. The ride with most duration departed from Clinton st and lasted for more than 74K seconds (or more than 20 hours!). It’s probably a user who forgot to return the bike =D
Since the Female riders seems to have longer trips on average on age hasn’t prove to have much different on trip duration, let’s try grouping our data by age and gender and check the average trip duration for the group:
grouped_df <- df %>%
group_by(Birth.Year,Gender) %>%
summarise(mean_trip_duration = mean(Trip.Duration)) %>%
ungroup() %>%
arrange(Birth.Year)
ggplot(aes(x = Birth.Year, y = mean_trip_duration),
data = subset(grouped_df, Gender %in% c('Male','Female'))) +
geom_line(aes(color = Gender), stat = 'summary', fun.y = median) + ggtitle("Mean Trip Duration per Gender and Distributed by Birth Year")For the birthday range where we have most of the datapoints (between 1950 and 2000), we can see that Female riders consistently have a higher trip duration than Male riders.
Let’s check the same analysis, but for customers x subscribers:
grouped_df <- df %>%
group_by(Birth.Year,User.Type) %>%
summarise(mean_trip_duration = mean(Trip.Duration)) %>%
ungroup() %>%
arrange(Birth.Year)
ggplot(aes(x = Birth.Year, y = mean_trip_duration),
data = subset(grouped_df, User.Type %in% c('Customer','Subscriber'))) +
geom_line(aes(color = User.Type), stat = 'summary', fun.y = median) + ggtitle("Mean Trip Duration per User Type and Distributed by Birth Year")## Warning: Removed 1 rows containing non-finite values (stat_summary).
Customers have longer trip than subscribers. Could that be a consequence of more female riders as customers than subscribers?
Finally, let’s check which are the 5 most popular stations on both starting and finishing rides:
grouped_start <- df %>%
group_by(Start.Station) %>%
summarise (no_rows = length(Start.Station))
grouped_start <- grouped_start [order (-grouped_start$no_rows),] %>%
mutate (Pareto = cumsum(grouped_start$no_rows/sum(grouped_start$no_rows)))
grouped_start <- head(grouped_start,5)
grouped_end <- df %>%
group_by(End.Station) %>%
summarise (no_rows = length(End.Station))
grouped_end <- grouped_end [order (-grouped_end$no_rows),] %>%
mutate (Pareto = cumsum(grouped_end$no_rows/sum(grouped_end$no_rows)))
grouped_end <- head(grouped_end,5)
end_point = 0.5 + nrow(grouped_start) + nrow(grouped_start)-1
barplot(grouped_start$no_rows,
width = 1, space = 1.0, border = NA, axes = F,
ylim = c(0, 1.05 * max(grouped_start$no_rows, na.rm = T)),
ylab = "Cummulative Counts" , cex.names = 0.7,
main = "5 Most Popular Start Stations")
text(seq(1.5,end_point,by=2), par("usr")[3]-0.25,
srt = 45, adj= 1, xpd = TRUE,labels = paste(grouped_start$Start.Station), cex=0.65)barplot(grouped_end$no_rows,
width = 1, space = 1.0, border = NA, axes = F,
ylim = c(0, 1.05 * max(grouped_end$no_rows, na.rm = T)),
ylab = "Cummulative Counts" , cex.names = 0.7, las=2,
main = "5 Most Popular End Stations")
text(seq(1.5,end_point,by=2), par("usr")[3]-0.25,
srt = 45, adj= 1, xpd = TRUE,labels = paste(grouped_end$End.Station), cex=0.65)Interesting! The 2 most popular start stations are also the 2 most popular end stations. Also, both Clinton St & Madison St and Lake Shore Dr & Monroe St appear in the top 5, although in different positions.
Let’s check how many (if any) trips start and finish in the same station by adding a new column to our dataset:
df <- df %>%
mutate (Same.Station = identical(df$End.Station,df$Start.Station))
sum(df$Same.Station)## [1] 0
There are not trips starting and ending at the same station.
I’ve chosen the charts that compare trip duration vs age for both gender and user type since they are the most insigthful charts as described later in the reflection. Also the trip duration distribution using plotly shows how concentrated the trips are and how age have a low influence in this specific variable.
grouped_df <- df %>%
group_by(Birth.Year,Gender) %>%
summarise(mean_trip_duration = mean(Trip.Duration)) %>%
ungroup() %>%
arrange(Birth.Year)
ggplot(aes(x = Birth.Year, y = mean_trip_duration),
data = subset(grouped_df, Gender %in% c('Male','Female'))) +
geom_line(aes(color = Gender), stat = 'summary', fun.y = median)grouped_per_weekday <- df %>%
group_by(day) %>%
summarise(mean_trip_duration = round(mean(Trip.Duration),0))
ggplot(grouped_per_weekday, aes(x = factor(day,c('Sunday','Monday','Tuesday','Wednesday','Thursday','Friday','Saturday')),
y = mean_trip_duration)) + geom_col() +
geom_text(aes(label=mean_trip_duration), position=position_dodge(width=0.9), vjust=-0.25) + ggtitle("Average Trip Duration (in seconds) per Weekday") + xlab("Average Trip Duration") + ylab("Weekday")set.seed(1)
sample_df <- df[sample(nrow(df), 10000), ]
plot_ly(data = sample_df, x = sample_df$Trip.Duration , y = sample_df$Birth.Year,
marker = list(size = 10,
color = 'rgba(255, 182, 193, .9)',
line = list(color = 'rgba(152, 0, 0, .8)',width = 2)),
text = ~paste("Start Station: ", sample_df$Start.Station, '$<br>Finish Station:', sample_df$End.Station)
) %>%
layout(title = 'Trip duration distribution',
yaxis = list(zeroline = FALSE),
xaxis = list(zeroline = FALSE))## Warning: Ignoring 2053 observations
Based on the analysis of the bike sharing database, we can evaluate a number of strategic information such as most engaged users’ profile and most popular stations.
First, regarding which day of the week has the most number of bike rides, it is obviously the weekend.
We can see that the customer base of this bike sharing company is unbalanced in terms of gender: a lot more Male riders than Female. However, Female riders have a longer trip duration on average, which probably also translates in higher NPV. Also, there’s no big difference in trip duration when looking at user age.
Regarding the stations, we can see that 2 stations are the most popular for both starting and ending trips: Streeter Dr & Grand Ave and Clinton St & Washington Blvd. They must be located in very important spots of Chicago.